Project

Author: Qi Meng Last modified data: 2018-02-25 ========================================================

The diamention of Proserper Loan Dataset

# length(pr)
dim(pr)
## [1] 113937     81

In total there are 81 variables that corresponding to each loan In order better understand the dataset, in this analysis, 19 variables will be selected.

pr <- subset(pr, select = c('LoanStatus',
                                'BorrowerAPR',
                                'BorrowerRate',
                                'LenderYield',
                                'ProsperScore',
                                'BorrowerState',
                                'Occupation',
                                'EmploymentStatus',
                                'IsBorrowerHomeowner',
                                'TotalCreditLinespast7years',
                                'TotalInquiries',
                                'BankcardUtilization',
                                'AvailableBankcardCredit',
                                'IncomeRange',
                                'IncomeVerifiable',
                                'LoanOriginalAmount',
                                'LoanOriginationDate',
                                'MonthlyLoanPayment',
                                'Investors'
))
str(pr)
## 'data.frame':    113937 obs. of  19 variables:
##  $ LoanStatus                : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ BorrowerAPR               : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate              : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield               : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ ProsperScore              : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ BorrowerState             : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus          : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ IsBorrowerHomeowner       : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ TotalCreditLinespast7years: int  12 29 3 29 49 49 20 10 32 32 ...
##  $ TotalInquiries            : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ BankcardUtilization       : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit   : num  1500 10266 NA 30754 695 ...
##  $ IncomeRange               : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable          : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LoanOriginalAmount        : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate       : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ MonthlyLoanPayment        : num  330 319 123 321 564 ...
##  $ Investors                 : int  258 1 41 158 20 1 1 1 1 1 ...

Univariate Plots Section

# table(pr$ProsperScore)


ggplot(aes(ProsperScore), data = pr) +
  scale_x_continuous(breaks = seq(1, 11, 1) ) +
  geom_histogram(stat = 'count', binwidth = 1)  
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 29084 rows containing non-finite values (stat_count).

# From the histogram, the majority of the ProserScore are 4, 6, and 8

From the histogram, the majority of the ProserScore are 4, 6, and 8

ggplot(aes(LoanStatus), data = pr) +
    geom_histogram(stat = 'count')    + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Most of the loan are current and the second is the completed loans

range_list <- c('$0','$1-24,999','$25,000-49,999','$50,000-74,999','$75,000-99,999',
                '$100,000+','Not employed','Not displayed')
pr$income_range <- factor(pr$IncomeRange,levels = range_list)

Create a new variabel income_range to better format the IncomeRange variable

ggplot(aes(pr$EmploymentStatus), data = pr) +
    geom_histogram(stat = 'count')    + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Ignoring unknown parameters: binwidth, bins, pad

The loan data mostly coming from people who works employed or full time.

ggplot(aes(pr$income_range), data = pr) +
    geom_histogram(stat = 'count')    + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Ignoring unknown parameters: binwidth, bins, pad

People whose income is between $25,000 and $100,000 applied for the loans.

Univariate Analysis

What is the structure of your dataset?

The ProsperLoan data have 113937 overvations and 19 variables ProsperSocre: A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009. Loanstatus: Completed, Current, Past Due (1-15 days), Defaulted, Chargedoff, Past Due (16-30 days), Cancelled, Past Due (61-90 days), Past Due (31-60 days), Past Due (91-120 days)

EmploymentStatus:Self-employed, Employed, Not available, Full-time,, Other, Not employed, Part-time, Retired
IncomeRange: $25,000-49,999, $50,000-74,999, Not displayed, $100,000+, $75,000-99,999, $1-24,999,Not employed, $0
$0 $1-24,999, $100,000+, $25,000-49,999, $50,000-74,999, $75,000-99,999, Not displayed, Not employed

What is/are the main feature(s) of interest in your dataset?

The main feature is the ProsperSocre, which measue the risk ability of the loan itself, verus the BorrowerRate

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The loan applicants’ occupation,income, bankcard utilization, available bank card credit and other variables might impact the rick score when valued by th Prosper Company

Did you create any new variables from existing variables in the dataset?

Yes, I created the new variable range_new to reorder the income range variable in a ascending order.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Yes. I did select 19 variables out of 81 in total. The reason for this is that not all the variables are revelent in determining the Prosper Score.

Bivariate Plots Section

ggplot(aes(income_range, ProsperScore ), data = subset(pr, !is.na(ProsperScore))) +
  geom_jitter(alpha = .05, shape = 21, colour = '#F79420') +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) 

For applicants with lower income, the Prosper Score seems lower than 6, and for people with higher income, the ProsperScore seems higher, which means less risk.

ggplot(aes(income_range, ProsperScore ), data = subset(pr, !is.na(ProsperScore))) +
  geom_boxplot()+
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) 

From the box plot we can see it more clearly, the median of people of income greater than $50,000 is much higher than people with income less than $50,000

ggplot(aes(EmploymentStatus, ProsperScore ), data = subset(pr, !is.na(ProsperScore))) +
  geom_jitter(alpha = .05, shape = 21, colour = '#F79420') +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) 

If we compare ProsperScore with employment status, clearly, full time employees will have greater ProperScore, and therefore, less risky to lend money to them

ggplot(aes( ProsperScore, BorrowerRate * 100 ), data = subset(pr, !is.na(ProsperScore))) +
  geom_jitter(alpha = .05, shape = 21, colour = '#F79420') +
  stat_smooth(method = 'lm')

Unsperisingly, applicants with higher prosperscore seem to have lower borrow rate.

with(pr, cor.test(pr$ProsperScore, pr$BorrowerAPR))
## 
##  Pearson's product-moment correlation
## 
## data:  pr$ProsperScore and pr$BorrowerAPR
## t = -261.68, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6719940 -0.6645469
## sample estimates:
##        cor 
## -0.6682872
ggplot(aes( EmploymentStatus, BorrowerRate * 100 ), data = subset(pr, !is.na(ProsperScore))) +
  geom_jitter(alpha = .05, shape = 21, colour = '#F79420') +
  stat_smooth(method = 'lm')

ggplot(aes(pr$AvailableBankcardCredit/1000, pr$BorrowerRate*100 ), 
  data = subset(pr, (!is.na(BorrowerRate)))) +
  # data = na.omit(pr)) +
  geom_jitter(alpha =.5, shape = 21, colour = '#F79420') +
   # xlim((quantile(AvailableBankcardCredit, 0.1), quantile(AvailableBankcardCredit, 0.80))) +
  scale_x_continuous(limits = c(1, 150),breaks = seq(1, 150, 20) ) +
  ylim(5,40)
## Warning: Removed 36065 rows containing missing values (geom_point).

Higer available bank card credit will also have lower borrow rate.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Other than main feature, I also noticed applicants with higher available bank card credit will also have lower borrow rate.

What was the strongest relationship you found?

Borrow rate verus the properscore, The coefficient between them is -0.6682872.

Multivariate Plots Section

ggplot(aes( BorrowerRate,ProsperScore), data = subset(pr, !is.na(ProsperScore)) ) +
  geom_jitter(alpha = .05, shape = 21, colour = '#F79420') +
  # theme(axis.text.x = element_text(angle = 60, hjust = 1))+
  facet_wrap(~income_range)

ggplot(aes(BorrowerRate,ProsperScore ), data = subset(pr, !is.na(ProsperScore))) +
  geom_jitter(alpha = .05, shape = 21, colour = '#099DD9') +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  facet_wrap(~EmploymentStatus)

The same pattern also occur for different employment status

ggplot(aes( ProsperScore,BorrowerRate* 100, colour = income_range), data = pr) + 
  geom_point(alpha = .5, size = .5, position = 'jitter') +
  scale_color_brewer(type =  'div', guide = guide_legend(title = 'income_range', 
                                                         reverse = FALSE, 
                                                         override.aes = list(alpha = 1, size = 2))) +
  scale_y_continuous(limits = c(quantile(pr$BorrowerRate * 100, .05), quantile(pr$BorrowerRate * 100, .90)))
## Warning: Removed 42364 rows containing missing values (geom_point).

When comparing with the main relationship ProsperScore and BorrowerRate, we also noticed a liner relationship adding the category variable income range

ggplot(aes(BorrowerRate*100, ProsperScore, colour = EmploymentStatus), data = pr) + 
  geom_point(alpha = .5, size = .75, position = 'jitter') +
  scale_color_brewer(type =  'div', guide = guide_legend(title = 'EmploymentStatus', 
                                                         reverse = FALSE, 
                                                         override.aes = list(alpha = 1, size = 2))) +
  scale_x_continuous(limits = c(quantile(pr$BorrowerRate* 100, .05), quantile(pr$BorrowerRate*100, .90)))
## Warning: Removed 42343 rows containing missing values (geom_point).

m1 <- lm(I(BorrowerRate) ~ 0 + I(ProsperScore), data = subset(pr, !is.na(ProsperScore)))
m2 <- update(m1, ~ . + AvailableBankcardCredit)
m3 <- update(m2, ~ . + IncomeRange)
m4 <- update(m3, ~ . + EmploymentStatus)
m5 <- update(m4, ~ . + TotalCreditLinespast7years)

mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = I(BorrowerRate) ~ 0 + I(ProsperScore), data = subset(pr, 
##     !is.na(ProsperScore)))
## m2: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + AvailableBankcardCredit - 
##     1, data = subset(pr, !is.na(ProsperScore)))
## m3: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + AvailableBankcardCredit + 
##     IncomeRange - 1, data = subset(pr, !is.na(ProsperScore)))
## m4: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + AvailableBankcardCredit + 
##     IncomeRange + EmploymentStatus - 1, data = subset(pr, !is.na(ProsperScore)))
## m5: lm(formula = I(BorrowerRate) ~ I(ProsperScore) + AvailableBankcardCredit + 
##     IncomeRange + EmploymentStatus + TotalCreditLinespast7years - 
##     1, data = subset(pr, !is.na(ProsperScore)))
## 
## ============================================================================================================================
##                                                   m1              m2              m3              m4              m5        
## ----------------------------------------------------------------------------------------------------------------------------
##   I(ProsperScore)                                 0.026***        0.028***       -0.018***       -0.019***       -0.019***  
##                                                  (0.000)         (0.000)         (0.000)         (0.000)         (0.000)    
##   AvailableBankcardCredit                                        -0.000***       -0.000***       -0.000***       -0.000***  
##                                                                  (0.000)         (0.000)         (0.000)         (0.000)    
##   IncomeRange: $0                                                                 0.352***        0.348***        0.348***  
##                                                                                  (0.008)         (0.008)         (0.008)    
##   IncomeRange: $1-24,999                                                          0.334***        0.336***        0.336***  
##                                                                                  (0.001)         (0.001)         (0.001)    
##   IncomeRange: $100,000+                                                          0.304***        0.307***        0.307***  
##                                                                                  (0.001)         (0.001)         (0.001)    
##   IncomeRange: $25,000-49,999                                                     0.317***        0.319***        0.318***  
##                                                                                  (0.001)         (0.001)         (0.001)    
##   IncomeRange: $50,000-74,999                                                     0.308***        0.310***        0.309***  
##                                                                                  (0.001)         (0.001)         (0.001)    
##   IncomeRange: $75,000-99,999                                                     0.306***        0.309***        0.308***  
##                                                                                  (0.001)         (0.001)         (0.001)    
##   IncomeRange: Not employed                                                       0.364***        0.368***        0.367***  
##                                                                                  (0.002)         (0.002)         (0.002)    
##   EmploymentStatus: Full-time/Employed                                                            0.025***        0.025***  
##                                                                                                  (0.001)         (0.001)    
##   EmploymentStatus: Other/Employed                                                               -0.003**        -0.003**   
##                                                                                                  (0.001)         (0.001)    
##   EmploymentStatus: Part-time/Employed                                                            0.016***        0.016***  
##                                                                                                  (0.003)         (0.003)    
##   EmploymentStatus: Retired/Employed                                                              0.020***        0.020***  
##                                                                                                  (0.003)         (0.003)    
##   EmploymentStatus: Self-employed/Employed                                                       -0.009***       -0.009***  
##                                                                                                  (0.001)         (0.001)    
##   TotalCreditLinespast7years                                                                                      0.000     
##                                                                                                                  (0.000)    
## ----------------------------------------------------------------------------------------------------------------------------
##   R-squared                                       0.612           0.621           0.932           0.933           0.933     
##   adj. R-squared                                  0.612           0.621           0.932           0.933           0.933     
##   sigma                                           0.131           0.129           0.055           0.054           0.054     
##   F                                          133697.431       69606.628      128923.549       84690.770       79044.868     
##   p                                               0.000           0.000           0.000           0.000           0.000     
##   Log-likelihood                              52266.161       53323.957      126092.434      126950.145      126950.686     
##   Deviance                                     1449.358        1413.669         254.367         249.276         249.273     
##   AIC                                       -104528.323     -106641.915     -252164.869     -253870.290     -253869.372     
##   BIC                                       -104509.626     -106613.869     -252071.382     -253730.059     -253719.793     
##   N                                           84853           84853           84853           84853           84853         
## ============================================================================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

From multivariate analysis, the borrower’s rate is also affacted by other variables such as income range, employment status, aviable bank credits.

Were there any interesting or surprising interactions between features?

Holding the borrower’s rate constant, the employed status will have lower prosper score.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes, I did. For the linear models I created, R square is about 93%, which means almost 93% percent of the variation can be explained by the model. I also excluded the intercept, which strenthed the linear relationship between the variables.


Final Plots and Summary

Plot One

## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 29084 rows containing non-finite values (stat_count).

Description One

The loans were valued by the risk levels which bing called as ProsperScore, and the greater the score the lower the risk. Histogram showed us the counts for different ProsperScore.Majority of the loans have the score between 4 and 9.

Plot Two

Description Two

Ggplot gives us the negative correlation between ProsperScore and the borrower’s rate, the higher the score seems to lead to a lower borrower’s rate.

Plot Three

## Warning: Removed 42348 rows containing missing values (geom_point).

Description Three

When considering other variables, for example, income range, will also help us in predicting the borrower’s rate.


Reflection

Prosper Loan dataset has thorogh loan data regrading theri unique attributes, and when evaluating the loan applications, these variables could benefit the company in deciding the accurate rate. From the the analysis, it is shown that borrower’s rate the highly correlated with borrower’s ProsperScore, which measured the risk of the applicant. We also learned that other factors such as income level, employment status, avaible bank credits could also affect borrower’s rate. More thorough demographic data can be included in the dataset, therefore, we can better the detailed attributs of the applicants.